Starting a Coffee Business in Toronto

Analysis about the Coffee Business in Toronto Based on the Yelp Dataset.

Haiyue Yang Feb 2020

Introduction

Considering you would like to start a coffee business business, but you do not have any experience. What are you going to do? Probably, you would ask a friend who owns a coffee shops for experience. But the experience from a single franchises might not be enough, so let's take an insight into the Yelp dataset to find out the mystery of starting a popular coffee shop.

In this report, we would use the data from the yelp dataset (For academic use only. Not allowed for commercial use)(Question 1.1) to analyse the businesses and reviews overall and in Toronto, focused mainly on coffee businesses, and especially on the two coffee giants Tim Horton's and Starbucks, and investigate how to start a coffee business in Toronto successfully.

About the data (Question 1.2)

business:

  • business_id: the unique id for each business
  • name: the name of the franchise
  • address: the address of the business
  • city: the city the business in
  • state: the state the business in
  • postal_code: the postal code of the business
  • latitude: the latitude of the location of the business
  • longitude: the longitude of the location of the business
  • stars: the stars rating of the business
  • review_count: the total number of reviews of the business
  • is_open: whether the business is open
  • attributes: the information of certain attributes of the business
  • categories: the categories that the business belongs to
  • hours: the openning hours of the business ### review:
  • review_id: the unique id for the review
  • user_id: the unique id for the user who wrote the review
  • business_id: the unique id for the business the reviews for
  • stars: the rating the reviewers gave
  • useful, funny, cool: the rate of different characteristics
  • text: the text of the review
  • date: the date of the reivew

Data Cleaning

Business Overall

First, we loaded the data on business from the yelp dataset into the dataframe df.

Then, we used the function fuzzywuzzy to find the similarity of city name pairs, unified the city name pairs with either the ratio, partial_ratio, or the token_set_ratio larger than 90, and stored the unified expressions in a new column modifed_city.

For the rest of the cities which cannot match to others, we simply copy their original expressions in city to modified_city.

There are 778 cities in the yelp dataset in total.

We drew a wordcloud containing the cities of top 100 number of businesses in the yelp data. It could gives us a glimpse into the popular cities that the yelp dataset encompasses. (Question 2.1 more discussions later).

Out[9]:

Next, we examined whether the business has bike parking spaces, by examining the dictionaries stored in the column attributes, and saved it in a new column BikeParking.

After that, according to the information provided on the wikipedia, the postal codes in the Great Toronto Area start with M and L, so we decided whether the businesses is in the GTA or not, by choosing the businesses with postal_code starting with M or L, and saved it in a new column GTA

This is a glimpse of the data.

Out[117]:
review_id user_id business_id stars useful funny cool text date GTA tim star
6 G7XHMxG0bx9oBJNECG4IFg jlu4CztcSxrKx56ba1a5AQ 3fw2X5bZYeW9xCz_zGhOHg 3.0 5 4 5 Tracy dessert had a big name in Hong Kong and ... 2016-05-07 01:21:02 True False False
14 JVcjMhlavKKn3UIt9p9OXA TpyOT5E16YASd7EWjLQlrw AakkkTuGZA2KBodKi2_u8A 1.0 1 1 0 I cannot believe how things have changed in 3 ... 2012-07-16 00:37:14 True False False
15 svK3nBU7Rk8VfGorlrN52A NJlxGtouq06hhC7sS2ECYw YvrylyuWgbP90RgMqZQVnQ 5.0 0 0 0 You can't really find anything wrong with this... 2017-04-07 21:27:49 True False False
19 4bUyL7lzoWzDZaJETAKREg _N7Ndn29bpll_961oPeEfw y-Iw6dZflNix4BdwIyTNGA 3.0 0 0 0 Good selection of classes of beers and mains. ... 2014-06-27 21:19:23 True False False
34 E6B-2U2sGG3xgmnNWZAEew DbccYu3OppWKl21OanZnTg YSUcHqlKMPHHJ_cTrqtNrA 1.0 0 0 0 Came here on a Thursday night at 6:30 p.m. My ... 2017-12-29 13:55:19 True False False

Business in the GTA

We extracted data in the Great Toronto Area (GTA) to a new dataframe GTA using the indicator stored in the column GTA already, and saved it for later use.

This is a glimpse of the new dataframe.

Reviews in the GTA

First, we loaded the data on review from the yelp dataset into the dataframe reviews.

After that,we extracted the reviews in the GTA area to the dataframe GTA_reviews by matching the business_id in the dataframe review to the business_id in the dataframe GTA.

This is a glimpse of the dataframe extracted.

Why Start a Coffee Business in Toronto?

Why Start a Business in Toronto?

In order to benefit from the business, before starting the business, we have to make sure we make the right decision, so we would like to investigate why it is rational to start a coffee business in Toronto.

We grouped the overall dataframe by the modified city name modified_city and counted the total number of business in each city. The following barplot gives information about the cities with top 10 number of business.

Unsurprisingly, Las Vegas, the city famous for industries in tourism, gaming and conventions, which in turn feed the retail and restaurant industries, are the city with the largest amount of business on Yelp. However, we noticed that our city, Toronto, as the biggest city in Canada, has the second large amount of business. Thus, if you do not want to move away from your home here and pay the extremely high rent in Las Vegas, Toronto might be a good choice to start a business. (Question 2.1)

Why Start a Coffee Business?

In addition to the location of the new business, our city Toronto, is flourishing in various industries, so we also have lots of choices considering the category of our business. Therefore, let's take a deeper insight into the business in the Great Toronto Area (GTA).

Firstly, we noticed that there might be several different expressions of the same franchises. For instance, the coffee shop Tim Horton's is sometiems expressed as Tim Hortons or simply Tim Horton. Thus, in order to find a more acurate number of businesses of top franchises, we use the function fuzzywuzzy to find business names that match to the franchises with top 50 largest number of businesses in GTA. If the match ratio is larger than 90, we assumed they are different expression of the same franchise name and standardized them. The standarized expressions are stored in a new column modified_name

Next, we grouped the GTA dataframe by the column modified_name, counted the total number of businesses of each franchise in the GTA, and visualized some of them.

The following plots give the information about the top 10 franchises with the largest amount of business in the GTA.

In the both of the plots above, coffee franchises are marked red.

We were surprised to discover that the two largest franchises in the GTA are all coffee franchises, Tim Hortons and Starbucks, and the number of business of these two franchises are also much large than that of the third largest franchises, McDonald's.

Morover, 3 of the top 10 franchises in the GTA are coffee franchises and the number of business of these 3 coffee franchises, Tim Hortons, Starbucks, and Second Cup, account for nearly half of the business of the top 10 franchises, showing the success of these coffee giants. (Question 3.2)

However, the success of these three coffee giants not necessarily means the flourishing of the whole coffee industry. Thus, we continued to investigate whether coffee is a popular category in whole in the GTA.

Secondly, we extracted the columns categories and attributes in the dataframe df and GTA, stored them in new dataframes category and GTA_cate respectively, and split each category of a business to a separate row. Then, we grouped by new dataframes category and GTA_cate by the splitted categories, counted the total number of each category overall and in GTA respectively, calculated the percentage of each category, and stored them in new dataframes category_count and GTA_cate_count. Next, we created a new dataframe top containing the information about the top 10 categories in the GTA. The column Overall stores the percentage of each category in the overall dataset, and the column GTA stores the percentage of each category in the GTA.

After that, using the top dataframe, we drew a bar plot showing the top 10 most frequent categories in the GTA and the percentage of businesses in each category overall and in GTA.

The graph above shows the top 10 most frequent business category in the GTA and there percentage in the GTA and overall.

It is obvious that, the percentage of the four categories associated with food and drinks: restaurants, food, bars, and coffee & tea, in the GTA are all high than that overall. This fact tells us probably Toronto is a proper place for business about food and drinks. (Question 3.1, more discussions below)

Nevertheless, we have only analysed the information aobut the most frequent catgories in the GTA. What about compare them to the most frequently categories overall?

Thirdly, we drew pie charts for both the top 10 most frequent categories in the GTA and overall.

The pies above demonstrate that restaurants, shopping, and food are the 3 most frequent categories overall and in GTA. In addition to these 3 categories, beauty & spas, nightlife, bars, and health & medical also have frequent business both overall and in the GTA.

Overall, customers pay more attentions to home services, local services, and automotives; in GTA, customers spare more money on coffee & tea, Chinese, Event Planning & Services. A possible explanation is that, overall, people are more concerned about the quality of life, so they focused more on different kind of services and automotives, while people in the GTA generally have more entertainments. (Question 2.2 & Question 3.1)

Specially, we noticed that although coffee & tea in not one of the top 10 most frequent categories overall, it is in the top 10 most frequent categories in GTA, and accounts for nearly 5% of the business of these 10 categories. Thus, starting a coffee business in Toronto can be a wise choice.

Where to Start the Coffee Business?

We have ensured that start the coffee business in Toronto is a wise choice, but inside the city, where is the best location for our business? Let's invesgate more about the location choice of our coffee business.

Location of the Coffee Business

We assumed that the number of reviews is a way to quantify the popularity of business and demonstrated all business with various numbers of reviews on the maps below.

On each of the two maps, the red points represent businesses with more than 1000 reviews, the orange points represent businessese with 100-1000 reviews, the gold points represent businesss with 10-100 reviews, and the yellow points represent businesses with less than 10 reviews.

The first map demonstrates businesses in the GTA. In the scope of the GTA, we discovered that businesses are concentrated in downtown Toronto and Markham, where population is also concentrated. Moreover, along the main arteries, such as the Yonge Stree, Hwy 401, Hwy 404, and Hwy 407, where the traffic is convenient, the businesses tend to have more reviews.

In the second map, we reduced the scope to the downtown Toronto. We discovered that businesses are concentrated in the Entertainment district and businesses along the Yonge street, the Bloor street, the College street, the Dundas street, and the Queen street, tend to have more reviews.

This discovery reconfirmed our claims that, in the GTA, the businesses are more concentrated where the population is more concentrated, and tend to have more reviews along the main arteries where the traffic is more convenient. (Question 3.3)

Therefore, as for the location for our coffee business, under the condition that allows in capital, I would rent a shop front near a intersection of main roads in downtown Toronto.

Distance to Tim Horton's and Starbucks

As we have already known that the two coffee giants, Tim Horton's and Starbucks, are the top 2 franchises in Toronto, before starting our own business, it is meaningful to learn from their strategy. Thus, we took an insight into the locations of Tim Horton's and Starbucks.

We extracted data of business of Tim Horton's and Starbucks together in the dataframe tim_star and separately in tims and stars, and plotted the location of these business on the map below.

In the map above, the green points represent business of Starbucks and the red points represent business of Tim Horton's. (Question 3.4, more discussions later)

The distribution of the locations of these two coffee giants follows our earlier conclusion: they are concentrated in the downtown area and along the main roads and highways.

We are also interested in whether we should choose a shop front near these two coffe giants or far from them. Thus, we then analysed the pattern of the distribution of these two coffee franchises.

Firstly, we would like to investigate whether there is always a Starbucks near every Tim Horton's.

I convert the latitude and longitude to kilometers and for every Tim Horton’s, I drew a square center at its location with length of squre equal to 2 * the given distance. If there is a Starbucks within the square, then it means there is a Starbucks within the given distance from the Tim Horton's.

Under this definition, I drew 4 pie charts showing the percentage of Tim Horton's with a Starbucks within 100 meters, 250 meters, 500 meters, and 1000 meters, respectively.

We were surprised to discover that more than 45% of Tim Horton's have a Starbucks within 100 meters, nearly 75% of Tim Horton's have a Starbucks within 250 meters, nearly 90% have a Starbucks within 500 meters, and even 96.5% percent have a Starbucks within 1000 meters. Thus, we could draw the conclusion that for a large amount of Tim Horton's, there is a Starbucks nearby. (Question 3.4 continue)

Secondly, we calculated the distance between every Tim Horton's and the nearest Starbucks and plotted the distribution of the distances in the following boxplot.

From the boxplot, we noticed that there are several outliers in the distance. A common definition for outliers are the values 1.5*IQR larger than the 75% quantile or 1.5*IQR less than the 25% quantile. In order to have a clearer insight into the majority of the data, we decided to remove these outliers.

Then, we drew a histogram of the distribution of the distance from the nearest Starbucks of Tim Hortons in GTA.

From the graph, the distribution of the distances between establishments of Tim Horton's and Starbucks is unimodel and strongly skewed to right, with a peak i the range between 0 and 250. It is obvious that the two coffee giants prefer cluster together, possibly because this is the best strategy under the game theory: they are competing for the best locations and they find it most valuable to even keep in the game of market share. (Question 3.4)

Probably, we should learn from the experience of these two coffee giants and also choose to clsuter together with them. However, we are also faced with a dilemma that the location chosen by the coffe giants might be better oeverall, but they might also attract away some of our potential customers, considering their large influence. Thus, let's analyse the choices of other coffee chops.

Thirdly, similarly, we calculated the proportion of other coffee shops with a Tim Horton's and Starbucks within different distance, using the same strategy as before, and drew the pie charts below.

The pie charts demonstrate that more than 80% of coffee shops have a Tim Horton's or Starbucks within 100 meters, nearly 95% of them have a Tim Horton's or Starbucks within 250 meters, and even 99% of them have a Tim Horton's or Starbucks within 500 meters. Thus, we could conclude that most of the coffee shops choose to locate near these two coffee giants. Probably, the sole coffee shop owners believe that the decision-making team of Tim Horton's and Starbucks are able to find the best location, so they just follow them when choosing shop fronts.

Therefore, we could simply follow this pattern and choose a location near a Tim Horton's or Starbucks as well.

How to Start the Coffee Business?

After determing where to start the coffee business, it is time for deciding the business strategy and production designs. In order to learn from the experience of others and make better decision, we would analyse the characteristics of some successful coffee businesses.

Definition of Successful Coffee Business.

We noticed that in the business dataset, there are two evaluation indexes for business: stars and review_count. In order to know how we can derive a definition of success from this two indexes, we analysed the relationship between stars and review_count.

First, we extracted the two indexes from the original dataframe to a new dataframe df_star and grouped it by stars. Then, we used the describe() to have a glimpse of the distribution of the data.

Out[47]:
review_count
count mean std min 25% 50% 75% max
stars
1.0 4874.0 5.815552 12.027597 3.0 3.0 3.0 5.0 392.0
1.5 4976.0 15.596664 44.308352 3.0 3.0 7.0 14.0 1258.0
2.0 11426.0 15.108874 31.761224 3.0 4.0 7.0 15.0 1658.0
2.5 18842.0 20.911581 75.269496 3.0 3.0 7.0 18.0 4117.0
3.0 25996.0 30.857286 90.686141 3.0 5.0 11.0 28.0 3944.0
3.5 35008.0 40.681130 112.376081 3.0 5.0 11.0 36.0 6708.0
4.0 35969.0 56.523228 170.368055 3.0 5.0 13.0 43.0 8348.0
4.5 27301.0 43.444453 126.271964 3.0 6.0 12.0 33.0 5075.0
5.0 28216.0 12.113942 28.907074 3.0 3.0 5.0 10.0 1936.0

From the dataframe above, we noticed that, for each level of stars, even the third quantile is still very small when comparing to the max number of reviews, which means there are extreme outliers in each level of stars. In order to have better visualizations, we choose to analyse the values affected by the outliers and not affected separately.

First, since both the mean value and the max value are strongly affected by outliers, we drew the bar plots below for the mean number of reviews and max number of reviews corresponding to each level of stars.

From the graph above, we noticed that the distribution of both the mean and max number of reviews are unimodel and skewed to left, with a peak at round 4.0 stars. (Question 2.4)

Second, we inverstigated the first, second, and third quantiles of number of reviews of each level of stars, which would not be affected by the extreme outliers. We drew the line chart for the three quantiles below.

From the graph above, we discovered that the distributions of the three quantiles of the number of reviews are also unimodel and skewed to left. The peak of both the second and third quantiles are at 4.0 stars while the first quantile is slightly larger at 4.5 stars.

This result is pretty similar to that of the mean and max values. Thus, either considering the influence of the extreme outliers or not, the businesses with around 4 stars tend to have the largest number of reviews. In addition, the number of reviews is increasing from 1 stars to 4 stars, but the number of reviews of business with 5 stars is small.

Therefore, generally more reviews would lead to high stars. However it is always difficult to be perfect, so with increasing number of reviews, it is hard to maintain 5 stars, and the businesses with 5 stars are usually shops with a small amount of good reviews. As a result, in later analysis, we would define businesses with 4 stars to be successful. (Question 2.4)

Attributes to Include

During this process of deciding business strategy, we have to determine whether to include certain attributes, such as bike parking spaces, outdoor seatings, and takeouts, in our business. There are numerous attributes we have to consider. In this report, I would take bikeparking spaces as an example. (Question 2.3)

First, extrated the businesses with bikeparking to a new dataframe BikeParking and splited different categories in a single row to multiple rows. Then, we merged the total number of each category in Category_count to BikeParking_category and calculted the percentage with bike parking in each category.

Then, we visulized our data in the bar plots below. The bar marked red represents the category coffee & tea.

The bar plots above demonstrated that the category restaurants has the largest amount of businesses with bike parking and followed by food, shopping, and beauty & Spas. Also, we noticed that the category coffee & tea is one of the top 10 categories with largest number of businesses. Moreoever, among the top 10 categories, coffee & tea has the largest percentage of businesses with bike parking. (Question 2.3)

Thus, for coffee shops, over all, bike parking spaces might be useful. However, in order to succeed, we would take a deeper insight into the successful examples.

Second, we extracted the businesses of the category coffee & tea to a new dataframe coffee_tea. Also, since we define the business with 4 stars ratings to be successful, we drew the pie charts below for the percentage of businesses with bike parking in the category coffe & tea overall and with 4 stars rating.

From the pie charts above, we discovered that 64.4% of businesses of category coffee & tea have bike parking, while among the businesses with 4 stars ratings in this category, 70% of them have bike parking.

The rate in bike parking is slightly larger, so probably including bike parking in the attributes of our coffee business is a wise choice.

Learning From Tim Horton's and Starbucks.

In addition to the coffee businesses with 4 stars rating, the characteristics for Tim Horton's and Starbucks to become coffee giants are also meaningful for us. This time, we would make use of the reviews of these two franchises. (Question 4.2)

First, we added two new columns tim and star to the GTA_review dataframe, indicating whether the review is for Tim Horton's or Starbucks. Then, we extracted the text from the reviews of the two franchises and stored them in tims_reviews and stars_reviews respectively.

Also, we created a list of meaningless words stopwords (partially copied from https://programminghistorian.org/en/lessons/counting-frequencies#frequencies and modified by myself), then we count the number of appearances of each words in the text of reviews, ignoring the words in stopwords. We saved the word counts in the reviews for Tim Horton's and Starbucks in tims_word and stars_word respectively, and created a wordcloud for each of them.

Out[874]:

This is the wordcoud for the reviews of Tim Hortons. It gives us an overall gimpse of the text in the reviews.

Out[875]:

This is the wordcoud for the reviews of Tim Hortons. It gives us an overall gimpse of the text in the reviews.

From the two word clouds, it seems that users tend to use similar languages in the reviews for Tim Horton's and Starbucks. More specifically, in the reviews of both franchises, words like 'location' and 'coffee' are repeatedly emphasized.

Next, let us quantify this observtion in the bar plots below. The bar plots show the 10 most frequent words in the reviews of Tim Hortons and Starbucks respectively.

The barplots demostrated that the words used in the reviews of the two franchises are pretty similar. The words 'coffee', 'location', 'order', 'service', 'like', 'place', 'staff', 'time' are mentioned very frequently in the reviews of both franchises, which means these qualities are rather important for all coffee franchises.

In the reviews of Tim Hortons, the word 'good' is mentioned frequently, while in the reviews of Starbucks, the word 'friendly' is mentioned frequently. Probably, the staff of Tim Horton's are generally good, but the staff in Starbucks are rather friendly.

Also, 'food' is mentioned frequently in the reviews of Tim Horton's while 'drinks' is mentioned frequently in the reviews of Starbucks. Probably, Tim Horton's is better at making food and Starbucks do better in drinks.

In addition, we are also interested in the reviews written by the subset of users who reviewed both establishments. Thus, we extracted the list of user_id in both tims_reviews and stars_reviews and counted the words they mentioned in their reviews. (Question 4.2)

Out[111]:
word tims_num stars_num num
0 location 419.0 686.0 1105.0
1 coffee 340.0 468.0 808.0
2 like 188.0 363.0 551.0
3 staff 187.0 325.0 512.0
4 place 175.0 319.0 494.0
5 service 204.0 255.0 459.0
6 good 181.0 229.0 410.0
7 time 159.0 246.0 405.0
8 people 158.0 247.0 405.0
9 order 194.0 201.0 395.0

The bar plot above demonstrated the top 10 words mentioned by the subset of users who reviewed both establishments. We discovered that the reviews tends to comment more for Starbucks, comparing to the Tim Horton's.

Also, we reproduced our procedure earlier, but this time we only used the data mentioned by the subset of users who reviewed both establishments.

It is not surprising that, the results are almost the same as that of all users. The reviewers tend to use similar language in the reviews of both Starbucks and Tim Hortons.

The words 'coffee', 'location', 'order', 'service', 'like', 'place', 'staff', 'time' are mentioned very frequently in the reviews of both franchises.

In the reviews of Tim Hortons, the word 'good' is mentioned frequently, while in the reviews of Starbucks, the word 'friendly' is mentioned frequently. Also, 'food' is mentioned frequently in the reviews of Tim Horton's while 'drinks' is mentioned frequently in the reviews of Starbucks.

Therefore, in our coffee shop, we have to pay more attention to the quality of coffee, the service of the staff, and the waiting time.

Considerations

  1. We discovered that a small group of users is responsible for most reviews. (See the pie charts below) (Question 4.1)

We separated the ourliers in the number of reviews written by each users. The outliers only account for less than 14% of reviewers, but they account for nearly 70% of the reviews. Thus, the reviews might only represent the opinions from this subset of customers and might not be representative for all customers.

  1. We are not able to automatically determine whether the reviewers were paid for writting reviews.
Out[806]:
business_id name address city state postal_code latitude longitude stars review_count is_open attributes categories hours modify modified_city BikeParking GTA modified_name coffee
23337 egLYFnycp8ktxMCvilFdLw Passport Photo 327B Spadina Avenue, Unit 203 Toronto ON M5T 2E9 43.654537 -79.398380 5.0 272 1 {'BikeParking': 'True', 'BusinessParking': '{'... Shopping, Photography Stores & Services {'Monday': '10:0-18:0', 'Tuesday': '10:0-18:0'... False toronto True True Passport Photo False
10824 g6AFW-zY0wDvBl9U82g4zg Baretto Caffe 1262 Don Mills Road Toronto ON M3B 2W7 43.744703 -79.346468 5.0 267 1 {'Ambience': '{'romantic': False, 'intimate': ... Restaurants, Italian, Cafes {'Monday': '7:30-18:0', 'Tuesday': '7:30-18:0'... False toronto True True Baretto Caffe False
3688 J9vAdD2dCpFuGsxPIn184w New Orleans Seafood & Steakhouse 267 Scarlett Road Toronto ON M6N 4L1 43.677744 -79.506248 5.0 90 1 {'GoodForMeal': '{'dessert': False, 'latenight... Steakhouses, Cajun/Creole, Restaurants, Seafood {'Tuesday': '17:0-21:30', 'Wednesday': '17:0-2... False toronto True True New Orleans Seafood & Steakhouse False
29070 WuH8ncHXNBvAna7t-BX7xg Step Up Massage & Rehab - Adelaide 218 Adelaide Street W, Suite 200 Toronto ON M5H 1W7 43.648681 -79.387424 5.0 84 1 {'AcceptsInsurance': 'True', 'BikeParking': 'T... Naturopathic/Holistic, Acupuncture, Doctors, H... {'Monday': '0:0-0:0', 'Tuesday': '12:0-20:0', ... False toronto True True Step Up Massage & Rehab - Adelaide False
30086 8vmuInQkTrRgCxlxo5y_mw Famik Esthetics Markham ON L3S 43.850933 -79.262029 5.0 81 1 {'WheelchairAccessible': 'False', 'ByAppointme... Skin Care, Beauty & Spas, Hair Removal, Waxing None False markham True True Famik Esthetics False

For instance, in the dataframe above, we sorted the subset of dataframe of business in the GTA with 5.0 ratings by review_count, and discovered that, most of the businesses have fewer than 100 reviews, but the top 2 of them even have more than 200 reviews. How the businesses owners give customers extra incentives to write reviews for them, by extremely impressive service, or simply by money? We never know. (Question 4.4)

  1. We only analyse based on the yelp dataset. Some franchises who do not use yelp for their business are ignored, so some conclusion might not be concrete.

  2. When using the function fuzzywuzzy, it is possible that we unified the names of different cities or franchises, and missed some matched pairs.

  3. The sample of business in the category coffee & tea with rating 4.0 stars is not big enough, so the conclusion based on this might not be accurate.